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Abstract 

We present a GPU parallel implementation of the numeric integration of the Vlasov 
equation in one spatial dimension based on a second order time-split algorithm with 
a local modified cubic-spline interpolation. We apply our approach to three different 
systems with long-range interactions: the Hamiltonian Mean Field, Ring and the 
self-gravitating sheet models. Speedups and accuracy for each model and different 
grid resolutions are presented. 
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1 Introduction 

Systems with long-range interactions are particularly important in physics, 
Coulomb forces being probably the most prominent example. Albeit their 
relevance, many of its properties are still not well understood. The long- 
range nature of the interaction leads to some interesting phenomena not ob- 
served for short-range interactions, such as the existence of quasi-stationary 
non-Gaussian states with diverging life-times with the number of particles, 
negative microcanonical heat capacity, inequivalence of ensembles and non- 
ergodicity [1,2.3 4.5.6 7]. Examples of systems with long range forces include 
self-gravitating systems [9], non-neutral plasmas [TUfTl] and models as the ring 
model [T2|fl~3| Hamiltonian Mean Field (HMF) |14j . one- dimensional gravity 
(infinite uniform density sheets) [T5lll6|ll7| . Free Electron Laser [18] and plasma 
single wave models [115], among others. For out of equilibrium situations, many 
of these studies rely on molecular dynamics simulations, i. e. solving numeri- 
cally the Hamiltonian equations of motion for the A-particle system. It is also 
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a well known fact that, under suitable conditions, the statistical description 
of the dynamics of long range interacting systems is equivalent to the Vlasov 
equation |Tf2U| . The numerical solution of the Vlasov equation was applied to 
the HMF model in Ref. [21 J and more recently to characterize non-equilibrium 
phase-transitions in the same model [2 2l23| . although the phase diagram is still 
open to a closer scrutiny |24| . 

One-dimensional models are important for a better understanding of many 
properties of long-range interacting systems. Therefore a fast numeric im- 
plementation of the solution of the Vlasov equation is of uttermost value in 
their investigation. Numerical solutions of the Vlasov equation are obtained 
either by Particle In Cell (PIC) methods [23], where the distribution func- 
tion is represented by a collection of macro-particles under the dynamics of 
the self-consistent mean-field force, or Eulerian methods where the distribu- 
tion is represented as the density of a non-compressible fluid on a numerical 
grid [26J. For higher dimensional models, PIC methods are more effective in 
computational effort, even though its applicability is limited by inherent sta- 
tistical noise and a poor description of the tails of the distribution. On the 
other hand, Eulerian methods are limited at higher dimensions by the num- 
ber of grid points required to accurately represent the distribution function 
(see [31,32,33] for a comparison of different Eulerian codes). With the rapid 
increase in computational power and the use of parallel machines Eulerian 
codes have been implemented up to two spatial dimensions |2*7f2 8,29,30j. 

We present in this paper an implementation in the CUDA framework [34J of a 
semi-Lagrangian solution method for the Vlasov equation in one dimensional 
systems with long range interactions, and applications to the ring, HMF and 
self-gravitating sheet models. In this approach the distribution function is 
represented on a numerical grid and a times-split algorithm is used to evolve 
the function by computing the characteristic curves and equating the value 
of the solution to its value at the foot of the characteristic [23.35.36j. This 
last step requires an interpolation scheme, and a common choice is to use a 
cubic spline, which is global on the grid due to the requirement to compute 
second order derivatives of the distribution function at grid points [37] . As a 
consequence its parallel implementation is of limited efficiency An alternative 
in Ref. |38] is to use a local spline on patches (tiles) in the grid, with the 
continuity of first derivatives at the borders of each patch. The values of the 
distribution at the grid points on each patch is stored in shared memory, 
and all steps are then performed on a patch-by-patch basis. This approach 
requires communication between processors handling different patches, which 
can be reduced by suitably restricting the time step |36| . Here we implement 
a different approach where the interpolation relies on the same form of cubic 
spline with second order derivatives computed from an eighth order finite 
difference method. 
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The structure of the paper is as follows: in section [2] we present the one- 
dimensional models to which our approach is applied, and section [3] presents 
and discusses the algorithms implemented in CUDA for the solution of the 
Vlasov equation. Section [4] presents the results obtained from the implemen- 
tation of the algorithm to the one-dimensional models, and speedups relative 
to a serial code. We conclude the paper with some concluding remarks in 
section HI 



2 One-dimensional models 



Since a detailed direct study of real three-dimensional systems with long- 
range interactions is a very difficult task, some simplified models have been 
introduced in the literature retaining qualitative features of realistic long-range 
systems (see [Tj and references therein) . The Hamiltonian of a one-dimensional 
model of N identical particles with unit mass can be written as: 

1 N i N 

i=i «<j=i 



where is the potential energy between particles i and j and the factor N^ 1 
ensures extensivity of the energy and corresponds to a change of time units. 
The three different models considered here correspond to different choices for 
Vij. The Ring model describes a system of N identical particles on a ring 
or radius R interacting through their gravitational attraction [T2|13j . With a 
choice of units, the interacting potential is given by 

Vij = F / 1 =^=' ( 2 ) 

V2Jl-coa(9i-6i) + e 



where 6{ is an angle coordinate specifying the position of particle % on the 
circle, and e is a (small) softening parameter used to avoid the divergence of 
the potential at zero distance. For increasing e the ring model tends to the 
HMF model, with potential [Tj] : 

= 1 - cos (6i -9j). (3) 



The third system considered is the sheet model formed by iV identical infinite 
self-gravitating parallel sheets of constant mass density [T6lll7| . he gravita- 
tional force between two sheets is therefore constant and they are allowed to 
cross each other, when the force changes sign. Considering only the motion in 
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the direction x perpendicular to the sheets, and again with a choice of units, 
the pair interaction potential is written as: 



Vij \Xi Xj \ . (4) 



The Vlasov equation for those models is thus: 
• df df df 

f= dt + dx V+ dv FM = ^ (5) 



where / = f(x,v,t) is the one particle distribution function, v the velocity of 
the particle, x the position coordinate {9 for the HMF and ring models), and 
the mean- field force F(x,t): 

F(x, t) = J v(x - x')f(x', v', t) dx'dv', (6) 



with v(x — x') given in eqs. (|2]-[4j). For the HMF model the mean-field force 
can be written as: 

F(6, t) = - sm{6)M x + cos(6)M y , (7) 



where the components of the "magnetization" vector M are then 

M x = J cos(9)f(9,v,t)dxdv, M y = J sin(0)/(0, v, t) dx dv. (8) 

This property implies that molecular dynamics simulation times for the HMF 
model with N particles scale as N instead of iV 2 , and is one the reasons why 
it is so extensively studied. 



3 Algorithms and CUDA implementation 



The semi-Lagrangian scheme used here is described in References |23|35II36| 
and can be summarized as follows. The one-particle distribution function is 
represented in a numerical grid in the one-particle phase space as f(xi,pj,t) 
where Xi and pj are position and velocity coordinates of the points in the 
grid on a finite domain x G [x m i n , x max ] and p G \p m iniPmax\- The distribution 
function at time t + At is obtained numerically by evolving the function back 
in time and using the invariance of / along the characteristic lines. This back- 
wards evolution is performed using a a time-split method. The mains steps 
are: 
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(1) Backwards time evolution (advection) of / in the spatial direction by a 
time step At/2 with constant momentum: 

f^(x,p) = f(x-pAt/2,p,t). (9) 

(2) Computation of the mean-field force using f^: 

FW(x) = - J ~v{x - x')^^' ,p') dx' dv' . (10) 

(3) Backwards time evolution in the momentum direction by a full time step 
At using x : 

f II \x,p) = f{x,p-F^\x)At,t). (11) 

(4) And as last step repeat ([I]): 

f(x,p, t + At) = f n \x - p At/2,p, t). (12) 

The values of the intermediate f^ 11 ^ and final distribution functions in 
steps ([!]), ([3]) and Q at the numerical grid points must be obtained from the 
known values at the previous step with a cubic spline as interpolation method. 
For a point with coordinate x, Xi < x < Xi+i, the interpolated value for f(x) 
knowing = f(xi) and f i+1 = f(x i+ i) is given by [37]: 

f(x) = af + (3f i+1 + 7 /r + Sfh, (13) 
where 



Xi+l — x 

a = , p — 1 — a, 

Xi+l Xi 

7 = a a (x i+1 - Xi) 2 , 5 = - - Xi) 2 , (14) 

o 

and /■' stands for the second derivative of / at X{. BY requiring that first 



order derivatives computed from eq. (13) are continuum across the boundaries 



of neighboring intervals, we obtain a tridiagonal system of equations for f": 

I Ji+1 1 Ji 1 Ji— 1 



6 3 6 



Ax 2 = f l+1 -2f + f l . u z = 0,...n, (15) 



with n the number of points in the corresponding direction. Even though 
useful in a sequential context, a direct efficient parallel implementation of the 



solution of system (15) is not effective enough. 



In order to compute the second order derivatives locally, i. e. involving only 
a small number of neighbor points, with good accuracy in order not to spoil 
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the quality of the cubic interpolation an eighth order centered finite difference 
approximation is used [40J: 



8 18 1 

+-f(u i+1 ) - -f(u i+2 ) + ^5/(^+3) - ^f(u i+ 4), (16) 



where u stands for either the momentum or position variables. Periodic bound- 
ary conditions are used both in the spatial and momentum direction. Unphys- 
ical effects are avoided by choosing the size of the domain sufficiently large. 

The distribution function is represented on an equally spaced grid (xi,pj), with 
n x x n p points, by a one-dimensional array f(xi,pj) — > f[j-n x +i], % — 0, . . . , n x 
and j = 0, . . . ,n p . The second order derivatives are represented similarly. The 
initial condition array is loaded in global GPU memory and all subsequent 
operations are performed there. Memory bandwidth is an important issue for 
efficiency is this memory intensive application, and the algorithm must ex- 
ploit as much as possible coalesced memory access. Reading and writing in 
the one-dimensional array in the x direction tends to be coalesced, but not on 
the p direction. This is an issue when computing the spline coefficients (the 
second order derivatives) necessary for the interpolation in the spatial advec- 
tion, but is avoided in the other parts of the advection process. To overcome 
this difficulty a global transpose of the / array is performed before computing 
/", and then another transpose is performed it the array /". As a consequence 
the same routine is used to compute the derivatives for both spatial and mo- 
mentum directions. The transpose of a bidimensional array written in the 
one-dimensional form can be performed efficiently close to full bandwidth in 
CUDA [39J. Our algorithm is synthesized as: 



(ii 
(iii 

(iv 

(v 
(vi 
(vii 
(Viii 



Transpose /; 
Compute /"; 
Transpose /"; 

Perform a spatial advection (step [T] above) using /" obtained in step (JTTJ) 
for the spline interpolation; 

Compute the mean-field force (step [2] above); 
Compute /"; 

Perform a momentum advection (step [3]) using /" from step 
Repeat steps ((iffyl) (step El). 



vi 



The computation of the force in step Q is implemented by a discretization 
of eq. 
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n p = n x 



256 


7 x 10~ 3 


10~ 3 


io- 2 


10~ 3 


512 


7 x 10" 4 


io- 4 


IO" 4 


1.7 x 10~ 5 


1024 


10~ 5 


3 x 10~ 6 


2 x 10~ 6 


9 x 10~ 7 


2048 


4 x 10~ 6 


io- 8 


3 x IO" 6 


io- 8 



Table 1 

Maximum relative error for the energy (<5e) and norm {SNorm) for the Finite Dif- 
ference (FD) approximation in eq. (16) and Global Spline (GS), with A = 0.01. 

4 Results and discussions 



The simulations presented here were performed on a GTX 560 Ti GPU with 
384 cores, 1GB global memory and 1,64 GHz clock speed, and a GTX 590 
GPU with 512 cores, 1,5 GB global memory and 1.26 GHz clock (in fact the 
GTX 590 has two identical devices but only one was used for the runs). The 
CPU has an i7-2600 processor with 3.4GHz clock and 16GB of RAM. In this 
section we present and discuss the results of simulations for the three one- 
dimensional models presented in section [2} All computations performed use 
double precision. 



4-1 Self- gravitating sheet model 



As a first test case let us apply our approach to the self-gravitating sheet model 
with pair interaction potential given by eq. Q, with a (waterbag) constant 
distribution in an interval as initial condition, i. e: 

f(x,p,t = 0) = l/4p x Q , if -x <x<x , and-p <p<po- (17) 



The parameters of the numeric grid are x max = —x m i n = 2.0, p m ax = —Pmin = 
2.0 and n p = n x = 256, 512, 1024, 2048 (different number of points in each 
direction can also be used). The waterbag initial condition is chosen with 
Xq = 1.0 and p = 0.5. Two time steps A = 0.1 and At = 0.01 were considered 
to assess numerical errors. Figure [T] shows some snapshots of the time evolution 
of the distribution function obtained from our code. 

Tables [T] and [2] present the relative errors for the energy 5e and total norm 
SNorm. The accuracy of our approach is similar to the a global spline as 
describe in |37j . for all cases considered. The speedups obtained by comparing 
our parallel code to a CPU serial version using the same interpolation method 
are shown in table [3j The speedups grow with n p and n x for two main reasons. 
First not all the latencies of the GPU are covered with a small number of grid 
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Fig. 1. Snapshots of the distribution function for the self-gravitating sheet model 



with x r 



2.0, Pn 



2.0, n„ 



77,, 



2048, At = 0.01, a wa- 



terbag initial condition with x = 1.0 and p = 0.5, and t = 0, 10, 100, 200, 500, 1000. 
In each graphic p and x correspond to the horizontal and vertical axis respectively 



Tip — n x 


5e^ 


5Norm^ 


6e (GS) 


5iVorm (GS) 


256 


5 x 10" 4 


4 x 10~ 5 


2 x 10~ 4 


4 x 10~ 5 


512 


4 x 10~ 4 


2 x 10~ 7 


4 x 10~ 4 


2 x 10" 7 


1024 


3 x 10~ 4 


3 x lO" 10 


3 x 10~ 4 


io- 10 


2048 


3 x 10" 4 


8 x 10" 13 


3 x 10" 4 


io- 14 



Table 2 

Same as table [T] with At = 0.1. 



77 p = n x GTX 570 Ti GTX 590 

256 17 20 

512 25 31 

1024 38 51 

2048 51 71 

Table 3 

Speedups for the self-gravitating sheet model for different grid resolutions. 



points. And second, with smaller grid spacings the possibility of coalesced 
access to global memory is significantly increased. 
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Fig. 2. Entropy for the sheet model for different grid resolutions. The runs are the 
same as in table [2] 

The Vlasov dynamics has an infinite number of invariants, called Casimirs, of 
the form 



C[8] 



s(f(p, x, £)) dp dx. 



(18) 



This fact can be used to asses how information on the initial condition is 
lost due to the finite resolution of the numerical grid and other sources of 
errors. For this purpose we consider the entropy of the distribution / given 
by s{f) = —/log / in eq. (18). Figure [2] shows the time dependence of S 



for different grid resolutions. Information loss starts to be significant when 
filamentation is of the order of the grid spacing. These non- Vlasov effects 
are inherent to Eulerian solvers and must be considered with due care in 
the numerical solutions of the Vlasov equation |2T|41f42|l43j . For the highest 
resolution (2024 x 2024), there are are two plateaus, one at the initial stage, 
before the formation of filamentation, and another at the final stage, when 
details smaller than the grid resolutions were lost. 



4.1.1 The HMF model 

For the HMF model as defined by the pair interaction potential in eq. ^ 
Molecular Dynamics (MD) simulations scale with the number of particles N. 
Therefore it is possible to compare results from MD simulations with the 
solutions of the Vlasov equation, which describes the statistical properties of 
the particle dynamics in the N — > limit |H20) . As initial condition we consider 
a waterbag with total energy per particle e = 0.7 and average magnetization 
M = yjM% + Ml = 0.8, with M x = (cos(0)), M y = (sin(0)). This corresponds 
to a uniform distribution in the interval < 9 < 2.262 and —1.766 < p < 
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Tip — n x 


blA 3(U 1 1 


l A oyU 


256 


22 


24 


512 


29 


35 


1024 


35 


37 


2048 


39 


53 



Table 4 

Speedups for the HMF model for different grid resolutions. 

1.766. All integrations were performed with At = 0.1, a spatial grid < 9 < 2ir 
and momentum grid — p ma x < p < p m ax with p ma x = 3.531 and n p = rig = 
256, 512, 1024, 2048. Snapshots of the time evolution of the distribution are 
shown in Fig. [3] with a strong filamentation already present at t = 50. 

Figure [4] shows the graphic of the potential energy obtained from a MD simula- 
tion with N = 20, 000, 000 particle using a sympletic integrator with time step 
At = 0.1 [33], and the same curve obtained from our code with a numerical 
grid with n p = ng = 2048 points. Both simulations are in very good agreement 
up to roughly t ~ 70.0, after which the details of small fluctuations differ. This 
is due to the finite number of particles in the MD simulation and the strong 
formation of filaments by the natural evolution of the distribution function 
down to scales of the size of the grid spacing. Nevertheless the asymptotic 
behavior is the same in both cases. The entropy for different grid resolutions 
is shown in Fig. |5j Analogously to the sheet model, it has two plateaus, one 
before indentations scale reaches grid resolution, and a final plateau after the 
distribution is coarse grained. The speedups for the HMF model are shown in 
table |4j and are somewhat smaller than those for the sheet model. This comes 
from the fact that the serial code is well optimized by using eq. ([7]). Also the 
rate of successful coalesced memory access depends on the dynamics of the 
model, i. e. how far each grid point is moved by the advection. Figure [5] shows 
the entropy for the HMF model for different number of grid points n p = n$. 
The behavior is qualitatively the same as the previous case. 



4-1.2 The Ring model 

Computation of the mean-force field ^ for the pair interaction potential given 
in eq. ^ involves the computation of sin(# — 9') = sin9 cos 9' — cos 9 sin 9', 
and can therefore be optimized by simply computing and storing two arrays 
with the values of sin 9 and cos 9 at the spatial grid points at the beginning of 
the simulation. The CUDA function rsqrt for the inverse of the square-root in 
eq. ^ is exploited as it is cheaper than to computed both a square root and its 
inverse. The remaining steps are as described above. The speedups obtained 
are presented in table [5] The behavior with the number of grid points is similar 
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t=0.0 t=10.0 




Fig. 3. Snapshots of the distribution function of the HMF model for t = 0, 10, 50, 300. 
In each graphic p and 9 correspond to the horizontal and vertical axis respectively. 




Fig. 4. Mono-Log graphic of the potential energy for the HMF model obtained from 
Molecular Dynamics simulation (dots) with N = 20, 000, 000 particles and from the 
numerical solution of the Vlasov equation with n p = ng = 2048 (continuous line). 
The right panel is a zoom over a region of the left panel. 



to the two previous models. 
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Fig. 5. Entropy for the HMF model for different grid resolutions. 



n v = n x GTX 570 Ti GTX 590 



256 
512 
1024 
2048 



17 

29 
46 
57 



19 
33 
55 
73 



Table 5 

Speedups for the Ring model model. 
5 Concluding remarks 



We presented a GPU implementation using CUDA of a parallel numeric solver 
for the Vlasov equation on a GPU, based on a time-split scheme with a mod- 
ified cubic spline interpolation for both the spatial and momentum direction 
in phase space. The interpolation relies on a finite-difference scheme to ac- 
curately determine the second order derivatives required by the cubic spline 
interpolation in such a a way than only a small number of neighboring points 
is required, leading to a faster and simpler parallel implementation. Coalesced 
access to global memory in the GPU is ensured by performing a transpose 
of the distribution function and its second derivatives when performing the 
advection in the spatial direction. Implementations for three different one- 
dimensional long-range interacting models were presented with a discussion 
of accuracy and speedups of the simulations. The Vlasov dynamics leads to 
the formation of indentations in a scale which becomes smaller with time, 
and due to the finite grid accuracy, information loss ensues after some time, 
leading to a coarse-grained distribution, and an increase in entropy. Before 
the scale of indentation reaches grid accuracy the entropy is conserved by 
our approach. The same occurs after the distribution has been coarse-grained. 
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Higher order time split schemes and a filtering procedure can also be imple- 
mented if required [31 J . Although there is certainly room for improvements 
in our algorithm, the speedups obtained allow to conclude that the present 
parallel implementation is a useful tool in the ongoing investigations on open 
problems for long-range interacting systems. 
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